22 research outputs found
An OpenCL software compilation framework targeting an SoC-FPGA VLIW chip multiprocessor
Modern systems-on-chip augment their baseline CPU with coprocessors and accelerators to increase overall computational capability and power efficiency, and thus have evolved into heterogeneous multi-core systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This paper discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a highly configurable VLIW Chip Multiprocessor architecture known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on a number of hardware configurations of the LE1 CMP. The presented OpenCL framework fully automates the compilation flow and supports work-item coalescing which better maps onto the ILP processor cores of the LE1 architecture. This paper discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework by running 12 industry-standard OpenCL benchmarks drawn from the AMD SDK and the Rodinia suites. The benchmarks are executed on 40 LE1 configurations with 10 implemented on an SoC-FPGA and the remaining on a cycle-accurate simulator. Across 12 OpenCL benchmarks results demonstrate near-linear wall-clock performance improvement of 1.8x (using 2 dual-issue cores), up to 5.2x (using 8 dual-issue cores) and on one case, super-linear improvement of 8.4x (FixOffset kernel, 8 dual-issue cores). The number of OpenCL benchmarks evaluated makes this study one of the most complete in the literature
An efficient multiple precision floating-point Multiply-Add Fused unit
Multiply-Add Fused (MAF) units play a key role in the processor's performance for a variety of applications. The objective of this paper is to present a multi-functional, multiple precision floating-point Multiply-Add Fused (MAF) unit. The proposed MAF is reconfigurable and able to execute a quadruple precision MAF instruction, or two double precision instructions, or four single precision instructions in parallel. The MAF architecture features a dual-path organization reducing the latency of the floating-point add (FADD) instruction and utilizes the minimum number of operating components to keep the area low. The proposed MAF design was implemented on a 65 nm silicon process achieving a maximum operating frequency of 293.5 MHz at 381 mW power
VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads
We discuss VThreads, a novel VLIW CMP with hardware-assisted shared-memory Thread support. VThreads supports Instruction Level Parallelism via static multiple-issue and Thread Level Parallelism via hardware-assisted POSIX Threads along with extensive customization. It allows the instantiation of tightlycoupled streaming accelerators and supports up to 7-address Multiple-Input, Multiple-Output instruction extensions. VThreads is designed in technology-independent Register-Transfer-Level VHDL and prototyped on 40 nm and 28 nm Field-Programmable gate arrays. It was evaluated against a PThreads-based multiprocessor
based on the Sparc-V8 ISA. On a 65 nm ASIC implementation VThreads achieves up to x7.2
performance increase on synthetic benchmarks, x5 on a parallel Mandelbrot implementation, 66% better on a threaded JPEG implementation, 79% better on an edge-detection benchmark and ~13% improvement on DES compared to the Leon3MP CMP. In the range of 2 to 8 cores VThreads demonstrates a post-route (statistical) power reduction between 65% to 57% at an area increase of 1.2%-10% for 1-8 cores, compared to a similarly-configured Leon3MP CMP. This combination of micro-architectural features, scalability, extensibility,
hardware support for low-latency PThreads, power efficiency and area make the processor an attractive proposition for low-power, deeply-embedded applications requiring minimum OS support
System on fabrics architecture using distributed computing
This paper describes a novel, distributed sensor
network with parallel processing capability based on the Instruction Systolic Array (ISA). A new computing paradigm is introduced where spatially distributed sensors are closely coupled to processing elements and the whole array forms a parallel computer. This may find applications in wearable devices for sensing the position and other metrics of a human body and
rapidly processing that data. A new programming model to implement the distributed computer on fabrics is described. The fabric-based distributed computing concept has been validated using a number of parallel applications including a real-time shape sensing and reconstruction application. The exemplar wearable system based on the ISA concept has been realized using
off-the-shelf microcontrollers and sensors. Results show that the application executes on the prototype ISA implementation in real time thus confirming the viability of the proposed architecture for fabric-resident computing devices
Parametric data-parallel architectures for TLM acceleration
We discuss the architecture and
microarchitecture of a scalable, parametric vector
accelerator for the TLM algorithm. Architecture-level
experimentation demonstrates an order of
magnitude complexity reduction for vector
lengths of 16 32-bit single-precision elements. We
envisage the proposed architecture replicated in a
SOC environment thus, forming a multiprocessor
system capable of tapping parallelism at the
thread level as well as the data level
Shape reconstruction using instruction systolic array
This paper describes a novel, 2D mesh architecture prototype based on the Instruction Systolic Array (ISA) paradigm for distributed computing on fabrics. We discuss a real-time shape sensing and
reconstruction application executing on this architecture and demonstrate a physical design for a wearable system based on ISA concept constructed out of off-the-shelf microcontrollers and sensors. Results demonstrate the application executes in 39 ms on our prototype ISA implementation thus confirming the viability of the proposed architecture for fabric-resident computing devices
Feasibility of imaging photoplethysmography
Contact and spot measurement have limited the application of photoplethysmography (PPG), thus an imaging PPG system comprising a digital CMOS camera and three wavelength light-emitting diodes (LEDs) is developed to detect the blood perfusion in tissue. With the means of the imaging PPG system, the ideally contactless monitoring with larger field of view and the different depth of tissue by applying multi- wavelength illumination can be achieved to understand the blood perfusion change. Corresponding to the individual wavelength LED illumination, the PPG signals can be derived in the both transmission and reflection modes, respectively. The outcome explicitly reveals the imaging PPG is able to detect blood perfusion in a illuminated tissue and indicates the vascular distribution and the blood cell response to individual wavelength LED. The functionality investigation leads to the engineering model for 3-D visualized blood perfusion of tissue and the development of imaging PPG tomography
Efficient protection of the pipeline core for safety-critical processor-based systems
The increasing number of safety-critical commercial
applications has generated a need for components with high
levels of reliability. As CMOS process sizes continue to shrink,
the reliability of ICs is negatively affected since they become
more sensitive to transient faults. New circuit designs must take
this fact into consideration, and incorporate adequate protection
against the effects of transient faults. This paper presents a
novel method for protecting the pipelined execution unit of an
embedded processor. It is based on a self-configured architecture
with hybrid redundancy that can mask single and multiple
errors, which can occur on storage elements due to transient
or permanent faults. This concept can be easily applied to any
processing architecture of this nature with a high safety integrity
level. Results from error-injection experiments are also reported
that show that this design can maintain a non-interrupted and
failure-free operation under single and double errors with a
probability that exceeds 99.4%
Study of the effects of SEU-induced faults on a pipeline protected microprocessor
This paper presents a detailed analysis of the behavior of a novel fault-tolerant 32-bit embedded CPU as compared to a
default (non-fault-tolerant) implementation of the same processor during a fault injection campaign of single and double faults. The
fault-tolerant processor tested is characterized by per-cycle voting of microarchitectural and the flop-based architectural states,
redundancy at the pipeline level, and a distributed voting scheme. Its fault-tolerant behavior is characterized for three different
workloads from the automotive application domain. The study proposes statistical methods for both the single and dual fault injection
campaigns and demonstrates the fault-tolerant capability of both processors in terms of fault latencies, the probability of fault
manifestation, and the behavior of latent faults
A system-on-chip vector multiprocessor for transmission line modelling acceleration
We discuss a configurable, System-on-Chip vector
multiprocessor for accelerating the Transmission Line
Modeling (TLM) algorithm with an architecture capable of
exploiting the two primary forms of parallelism in the code,
thread and data level parallelism. Theoretical results
demonstrate an order of magnitude reduction in the dynamic
instruction count for a scalar-processor/vector-coprocessor
configuration at a vector length of sixteen 32-bit singleprecision
elements. Furthermore, a multi-vector SoC
architecture consisting of ten such vector accelerators provides
a near-linear theoretical performance benefit of the order of
88% in three out of four benchmark configurations which is
orthogonal to the benefit realized by vectorization alone. We
discuss in detail this potent architecture and present
implementation data for the 2-way multi-processor VLSI
macrocell